{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Started\n",
    "\n",
    "This notebook provides a quick introduction to using **ParquetDB**, a lightweight file-based database system that leverages the Parquet format and PyArrow under the hood.\n",
    "\n",
    "If you have not installed ParquetDB yet, you can do so via pip:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install parquetdb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a Database\n",
    "Initialize a ParquetDB instance by specifying the dataset name. In practice, you would also specify a path where Parquet files will be stored. If no path is provided, a default directory may be used (e.g., current working directory)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 1\n",
      "• Number of rows: 0\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [0]\n",
      "• Number of row groups per file: [1]\n",
      "• Serialized metadata size per file: [312] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "• Columns:\n",
      "    - id\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import shutil\n",
    "from parquetdb import ParquetDB\n",
    "\n",
    "\n",
    "db_path = \"ParquetDB\"\n",
    "if os.path.exists(db_path):\n",
    "    shutil.rmtree(db_path)\n",
    "\n",
    "db = ParquetDB(db_path=db_path)\n",
    "print(db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding Data\n",
    "You can add new records to your ParquetDB instance using `db.create()`. The data can be:\n",
    "- A dictionary of field-value pairs\n",
    "- A list of dictionaries\n",
    "- A Pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data added successfully!\n",
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 4\n",
      "• Number of rows: 2\n",
      "• Number of files: 1\n",
      "• Number of rows per file: [2]\n",
      "• Number of row groups per file: [1]\n",
      "• Serialized metadata size per file: [896] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "• Columns:\n",
      "    - age\n",
      "    - occupation\n",
      "    - name\n",
      "    - id\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [\n",
    "    {\"name\": \"Charlie\", \"age\": 28, \"occupation\": \"Designer\"},\n",
    "    {\"name\": \"Diana\", \"age\": 32, \"occupation\": \"Product Manager\"},\n",
    "]\n",
    "\n",
    "db.create(data)\n",
    "\n",
    "print(\"Data added successfully!\")\n",
    "print(db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ParquetDB can handle schema evolution and handle nested data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "PARQUETDB SUMMARY\n",
      "============================================================\n",
      "Database path: ParquetDB\n",
      "\n",
      "• Number of columns: 6\n",
      "• Number of rows: 3\n",
      "• Number of files: 2\n",
      "• Number of rows per file: [2, 1]\n",
      "• Number of row groups per file: [1, 1]\n",
      "• Serialized metadata size per file: [1244, 1206] Bytes\n",
      "\n",
      "############################################################\n",
      "METADATA\n",
      "############################################################\n",
      "\n",
      "############################################################\n",
      "COLUMN DETAILS\n",
      "############################################################\n",
      "• Columns:\n",
      "    - id\n",
      "    - field1.subfield2\n",
      "    - name\n",
      "    - occupation\n",
      "    - age\n",
      "    - field1.subfield1\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data = [\n",
    "    {\"name\": \"Jimmy\", \"field1\": {\"subfield1\": \"value1\", \"subfield2\": \"value2\"}},\n",
    "]\n",
    "\n",
    "db.create(data)\n",
    "\n",
    "print(db)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whenever nested data is added, ParquetDB will flatten the data. This is because read and write operations can be performed much quicker on flattened data. ParquetDB also has an option to recover the nested structure during read operations.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading Data\n",
    "The `db.read()` method retrieves data from the ParquetDB. By default, the data is returned as a PyArrow Table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All Employees:\n",
      "<class 'pyarrow.lib.Table'>\n",
      "pyarrow.Table\n",
      "age: int64\n",
      "field1.subfield1: string\n",
      "field1.subfield2: string\n",
      "id: int64\n",
      "name: string\n",
      "occupation: string\n",
      "----\n",
      "age: [[28,32],[null]]\n",
      "field1.subfield1: [[null,null],[\"value1\"]]\n",
      "field1.subfield2: [[null,null],[\"value2\"]]\n",
      "id: [[0,1],[2]]\n",
      "name: [[\"Charlie\",\"Diana\"],[\"Jimmy\"]]\n",
      "occupation: [[\"Designer\",\"Product Manager\"],[null]]\n"
     ]
    }
   ],
   "source": [
    "# Read all data\n",
    "table = db.read()\n",
    "print(\"All Employees:\")\n",
    "print(type(table))\n",
    "print(table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can transform this data to a Pandas DataFrame if you prefer. Here, I am setting the `split_blocks` to `True` and `self_destruct` to `True` to avoid allocated unncessary memory. You can learn more about Pyarrow [Pandas Integration here](https://arrow.apache.org/docs/python/pandas.html#memory-usage-and-zero-copy)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = table.to_pandas(split_blocks=True, self_destruct=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can optionally specify filters and columns to read only what you need. This functionality is powered by PyArrow and Parquet’s metadata, enabling efficient predicate pushdown. To learn more about accepted PyArrow filters [PyArrow documentation](https://arrow.apache.org/docs/python/compute.html#filtering-by-expressions).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Just the Names:\n",
      "      name\n",
      "0  Charlie\n",
      "1    Diana\n",
      "2    Jimmy\n",
      "\n",
      "Employees older than 30:\n",
      "   age field1.subfield1 field1.subfield2  id   name       occupation\n",
      "0   32             None             None   1  Diana  Product Manager\n"
     ]
    }
   ],
   "source": [
    "# Read specific columns\n",
    "df = db.read(columns=[\"name\"]).to_pandas()\n",
    "print(\"\\nJust the Names:\")\n",
    "print(df)\n",
    "\n",
    "# Read data with filters\n",
    "from pyarrow import compute as pc\n",
    "\n",
    "age_filter = pc.field(\"age\") > 30\n",
    "df = db.read(filters=[age_filter]).to_pandas()\n",
    "print(\"\\nEmployees older than 30:\")\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Updating Data\n",
    "You can update records by calling `db.update()` with a list of items. Each item to be updated **must** include its `id` field. You can also add new fields here as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Records updated successfully!\n",
      "   age field1.subfield1 field1.subfield2  id     name       occupation\n",
      "0   28             None             None   0  Charlie         Designer\n",
      "1   32           value8           value9   1    Diana  Senior Engineer\n",
      "2   50           value1           value2   2    Jimmy  Senior Engineer\n"
     ]
    }
   ],
   "source": [
    "update_data = [\n",
    "    {\n",
    "        \"id\": 1,\n",
    "        \"age\": 32,\n",
    "        \"occupation\": \"Senior Engineer\",\n",
    "        \"field1\": {\"subfield1\": \"value8\", \"subfield2\": \"value9\"},\n",
    "    },\n",
    "    {\n",
    "        \"id\": 2,\n",
    "        \"age\": 50,\n",
    "        \"occupation\": \"Senior Engineer\",\n",
    "    },\n",
    "]\n",
    "\n",
    "\n",
    "db.update(update_data)\n",
    "\n",
    "\n",
    "print(\"Records updated successfully!\")\n",
    "\n",
    "df = db.read().to_pandas()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deleting Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete by id\n",
    "\n",
    "Remove specific records from the database by specifying their `id` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Records deleted successfully!\n",
      "   age field1.subfield1 field1.subfield2  id     name       occupation\n",
      "0   28             None             None   0  Charlie         Designer\n",
      "1   32           value8           value9   1    Diana  Senior Engineer\n"
     ]
    }
   ],
   "source": [
    "db.delete(ids=[2])\n",
    "print(\"Records deleted successfully!\")\n",
    "\n",
    "df = db.read().to_pandas()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete by columns\n",
    "You can also delete columns by specifying the column names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Columns deleted successfully!\n",
      "   age field1.subfield2  id     name       occupation\n",
      "0   28             None   0  Charlie         Designer\n",
      "1   32           value9   1    Diana  Senior Engineer\n"
     ]
    }
   ],
   "source": [
    "db.delete(columns=[\"field1.subfield1\"])\n",
    "print(\"Columns deleted successfully!\")\n",
    "\n",
    "df = db.read().to_pandas()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Delete by filters\n",
    "\n",
    "You can also delete records by specifying filters.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Records deleted successfully!\n",
      "   age field1.subfield2  id     name occupation\n",
      "0   28             None   0  Charlie   Designer\n"
     ]
    }
   ],
   "source": [
    "db.delete(filters=[pc.field(\"age\") > 30])\n",
    "print(\"Records deleted successfully!\")\n",
    "\n",
    "df = db.read().to_pandas()\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform Data\n",
    "\n",
    "Updating is a costly operation to perform on very large datasets. If you know ahead of time you are going to operate on the entire dataset before hand, using the `transform` method is much more efficient. Instead of searching for the ids, it will take the current dataset and apply the transformation to create a new dataset. In this example, I create a new column `age_bin` by binning the `age` from 0 to 100 in increments of 10.\n",
    "\n",
    "From the docs, here is the signature of the `transform` method:\n",
    "\n",
    "\n",
    "\n",
    "```python\n",
    "def transform(\n",
    "        self,\n",
    "        transform_callable: Callable[[pa.Table], pa.Table],\n",
    "        new_db_path: Optional[str] = None,\n",
    "        normalize_config: NormalizeConfig = NormalizeConfig(),\n",
    "    ) -> Optional[\"ParquetDB\"]:\n",
    "        \"\"\"\n",
    "        Transform the entire dataset using a user-provided callable.\n",
    "\n",
    "        This function:\n",
    "        1. Reads the entire dataset as a PyArrow table.\n",
    "        2. Applies the `transform_callable`, which should accept a `pa.Table`\n",
    "            and return another `pa.Table`.\n",
    "        3. Writes out the transformed data:\n",
    "            - Overwrites this ParquetDB in-place (if `new_db_path=None`), or\n",
    "            - Creates a new ParquetDB at `new_db_path` (if `new_db_path!=None`).\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   age field1.subfield2  id     name occupation                      age_bin\n",
      "0   28             None   0  Charlie   Designer  [0, 0, 1, 0, 0, 0, 0, 0, 0]\n"
     ]
    }
   ],
   "source": [
    "import pyarrow as pa\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "def binning_age_column(table):\n",
    "    df = table.to_pandas()\n",
    "    df[\"age_bin\"] = df[\"age\"].apply(\n",
    "        lambda x: np.histogram(x, bins=range(0, 100, 10))[0]\n",
    "    )\n",
    "    return pa.Table.from_pandas(df)\n",
    "\n",
    "\n",
    "db.transform(binning_age_column)\n",
    "\n",
    "df = db.read().to_pandas()\n",
    "print(df)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "parquetdb_dev",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
